nadaraya-watson regression
Towards a Relationship-Aware Transformer for Tabular Data
Konstantinov, Andrei V., Zuev, Valerii A., Utkin, Lev V.
Deep learning models for tabular data typically do not allow for imposing a graph of external dependencies between samples, which can be useful for accounting for relatedness in tasks such as treatment effect estimation. Graph neural networks only consider adjacent nodes, making them difficult to apply to sparse graphs. This paper proposes several solutions based on a modified attention mechanism, which accounts for possible relationships between data points by adding a term to the attention matrix. Our models are compared with each other and the gradient boosting decision trees in a regression task on synthetic and real-world datasets, as well as in a treatment effect estimation task on the IHDP dataset.
Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel
Zheng, Chuanyang, Sun, Jiankai, Gao, Yihang, Xie, Enze, Wang, Yuehao, Wang, Peihao, Xu, Ting, Chang, Matthew, Ren, Liliang, Li, Jingyao, Xiong, Jing, Rasul, Kashif, Schwager, Mac, Schneider, Anderson, Wang, Zhangyang, Nevmyvaka, Yuriy
Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $\mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using $\mathrm{Softmax}$ to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the \textbf{zero-additional-cost} Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to $\mathrm{Softmax}$. We demonstrate that this router generalizes both $\mathrm{Sigmoid}$- and $\mathrm{Softmax}$-based routers. \textbf{Based on empirical observations and established practices in FFN implementation, we recommend the use of $\mathrm{ReLU}$ activation and $\ell_2$-normalization in $\mathrm{KERN}$ router function.} Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function \methodNorm.
A Note on Doubly Robust Estimator in Regression Continuity Designs
This note introduces a doubly robust (DR) estimator for regression discontinuity (RD) designs. RD designs provide a quasi-experimental framework for estimating treatment effects, where treatment assignment depends on whether a running variable surpasses a predefined cutoff. A common approach in RD estimation is the use of nonparametric regression methods, such as local linear regression. However, the validity of these methods still relies on the consistency of the nonparametric estimators. In this study, we propose the DR-RD estimator, which combines two distinct estimators for the conditional expected outcomes. The primary advantage of the DR-RD estimator lies in its ability to ensure the consistency of the treatment effect estimation as long as at least one of the two estimators is consistent. Consequently, our DR-RD estimator enhances robustness of treatment effect estimators in RD designs.
Flexible conditional density estimation for time series
Grivol, Gustavo, Izbicki, Rafael, Okuno, Alex A., Stern, Rafael B.
This paper introduces FlexCodeTS, a new conditional density estimator for time series. FlexCodeTS is a flexible nonparametric conditional density estimator, which can be based on an arbitrary regression method. It is shown that FlexCodeTS inherits the rate of convergence of the chosen regression method. Hence, FlexCodeTS can adapt its convergence by employing the regression method that best fits the structure of data. From an empirical perspective, FlexCodeTS is compared to NNKCDE and GARCH in both simulated and real data. FlexCodeTS is shown to generally obtain the best performance among the selected methods according to either the CDE loss or the pinball loss.
BENK: The Beran Estimator with Neural Kernels for Estimating the Heterogeneous Treatment Effect
Kirpichenko, Stanislav R., Utkin, Lev V., Konstantinov, Andrei V.
A method for estimating the conditional average treatment effect under condition of censored time-to-event data called BENK (the Beran Estimator with Neural Kernels) is proposed. The main idea behind the method is to apply the Beran estimator for estimating the survival functions of controls and treatments. Instead of typical kernel functions in the Beran estimator, it is proposed to implement kernels in the form of neural networks of a specific form called the neural kernels. The conditional average treatment effect is estimated by using the survival functions as outcomes of the control and treatment neural networks which consists of a set of neural kernels with shared parameters. The neural kernels are more flexible and can accurately model a complex location structure of feature vectors. Various numerical simulation experiments illustrate BENK and compare it with the well-known T-learner, S-learner and X-learner for several types of the control and treatment outcome functions based on the Cox models, the random survival forest and the Nadaraya-Watson regression with Gaussian kernels. The code of proposed algorithms implementing BENK is available in https://github.com/Stasychbr/BENK.
LARF: Two-level Attention-based Random Forests with a Mixture of Contamination Models
Konstantinov, Andrei V., Utkin, Lev V.
New models of the attention-based random forests called LARF (Leaf Attention-based Random Forest) are proposed. The first idea behind the models is to introduce a two-level attention, where one of the levels is the "leaf" attention and the attention mechanism is applied to every leaf of trees. The second level is the tree attention depending on the "leaf" attention. The second idea is to replace the softmax operation in the attention with the weighted sum of the softmax operations with different parameters. It is implemented by applying a mixture of the Huber's contamination models and can be regarded as an analog of the multi-head attention with "heads" defined by selecting a value of the softmax parameter. Attention parameters are simply trained by solving the quadratic optimization problem. To simplify the tuning process of the models, it is proposed to make the tuning contamination parameters to be training and to compute them by solving the quadratic optimization problem. Many numerical experiments with real datasets are performed for studying LARFs. The code of proposed algorithms can be found in https://github.com/andruekonst/leaf-attention-forest.
Heterogeneous Treatment Effect with Trained Kernels of the Nadaraya-Watson Regression
Konstantinov, Andrei V., Kirpichenko, Stanislav R., Utkin, Lev V.
The efficient treatment for a patient with her/his clinical and other characteristics [1, 2] can be regarded as an important goal of the real personalized medicine. The goal can be achieved by means of the machine learning methods due to the increasing amount of available electronic health records which are a basis for developing accurate models. To estimate the treatment effect, patients are divided into two groups called treatment and control, and then patients from the different groups are compared. One of the popular measures of the efficient treatment used in machine learning models is the average treatment effect (ATE) [3], which is estimated on the basis of observed data about patients as the mean difference between outcomes of patients from the treatment and control groups. Due to the difference between the patients characteristics and the difference between their responses to a particular treatment, the treatment effect is measured by the conditional average treatment effects (CATE) or the heterogeneous treatment effect (HTE) defined as ATE conditional on a patient feature vector [4, 5, 6, 7]. Two main problems can be pointed out when CATE is estimated. The first one is that the control group is usually larger than the treatment group. As a result, we meet the problem of a small training dataset, which does not allow us to apply directly many efficient machine learning methods.